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Abstract 

In the framework of supervised classification (discrimination) for functional 
data, it is shown that the optimal classification rule can be explicitly obtained for 
a class of Gaussian processes with "triangular" covariance functions. This explicit 
knowledge has two practical consequences. First, the consistency of the well- 
known nearest neighbors classifier (which is not guaranteed in the problems with 
functional data) is established for the indicated class of processes. Second, and 
more important, parametric and nonparametric plug-in classifiers can be obtained 
by estimating the unknown elements in the optimal rule. 

The performance of these new plug-in classifiers is checked, with positive re- 
sults, through a simulation study and a real data example. 

1 Introduction 

Statement of the problem. Notation 

Discrimination, also called "supervised classification" in modern terminology, is one 
of the oldest statistical problems in experimental science: the aim is to decide whether 
a random observation X (taking values in a "feature space" J-" endowed with a dis- 
tance D) either belongs to the population Pq or to Pi. For example, in a medical 
problem Pq and Pi could correspond to the group of "healthy" and "ill" individuals, 
respectively. The decision must be taken from the information provided by a "training 
sample" = {{Xi, Yj),! < i < n}. Here Xi, i = 1, . . . ,n, are independent replications 
of X, measured on n randomly chosen individuals, and Yi are the corresponding values 
of an indicator variable which takes values or 1 according to the membership of the 
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i-th individual to Pq or Pi. The term "supervised" refers to the fact that the individuals 
in the training sample are supposed to be correctly classified, typically using "external" 
non statistical procedures, so that they provide a reliable basis for the assignation of 
the new observation. It is possible to consider the case where K > 2 populations, 
Pq, . . . , Pk-1 are involved but, in what follows, we will restrict ourselves to the binary 
case K = 2. 

The mathematical problem is to find a "classifier" (or "classification rule") gn{x) = 
gn{x; Xn)i with g^: T ^ {0, 1}, that minimizes the classification error P{5f„(X) ^ F}. 
It is not difficult to prove (e.g., Devroye et at, 1996, p. 11) that the optimal classification 
rule (often called "Bayes rule") is 

g*{x) =I{rjix)>l/2}{x), (1) 

where ri{x) = K{Y\X = x) and stands for the indicator function of a set A C 
Of course, since t] is unknown the exact expression of this rule is usually unknown, and 
thus different procedures have been proposed to approximate g* using the training data. 

From now on we will use the following notation. Let /ij be the distribution of X 
conditional onY = i, that is, ^ii{B) = F{X & B\Y = i} for B G Bjr (the Borel a-algebra 
on J^) and i = 0,1. We denote by Si C J-" the support of yUj, for z = 0, 1, 5 = 5*0 fl 
and p = F{Y = 0} (we assume < p < 1). Given two measures fi and u, the expression 
fi << u denotes that fi is absolutely continuous with respect to u (i.e., i^(i?) = implies 
fi{B) = 0). 

The notation C[0, 1] stands for the space of real continuous functions on the interval 
[0, 1] endowed with the usual supremum norm, denoted by || • || . The subspace of functions 
of class 2 (i.e. with two continuous derivatives) is denoted by C^[0, 1]. 

Finite dimensional spaces. Three classical discrimination procedures 

The origin of the discrimination problem goes back to the classical work by Fisher 
(1936) where, in the (i-variate framework J-" = M.'^, a simple "linear classifier" of type 
gn{x) = I{w'x+wo>o} was introduced for the case that both populations Pq and Pi are 
homoscedastic, that is, have a common covariance matrix S. Intuitively, w'x + Wq = 
is chosen as the affine hyperplane which provides the "maximum separation" between 
both populations. It is well-known (see, e.g., Duda et al. 2000 for details) that the 
the expression of Fisher's rule turns out to depend on the inverse S"^ of the covariance 
matrix. It is also known that Fisher's linear rule is in fact the optimal one ([1]) when 
the conditional distributions of X|y = and X\Y = 1 are homoscedastic normals and 
all the means and covariances are known. These conditions look quite restrictive but, 
as argued by Hand (2006) in a provocative paper. Fisher's rule (or rather its sampling 
approximation obtained by estimating the unknown parameters) is hard to beat in 
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practical examples. That is, while it is not difficult to construct examples where this 
rule outrageously fails, its performance is quite good in most cases found in real-life 
examples. For this reason. Fisher's linear rule is still the most popular classification tool 
among practitioners, in spite of the posterior intensive research on this topic. Thus, in 
a way. Fisher's rule represents a sort of "golden standard" in the multivariate statistical 
discrimination problem. 

The books by Devroye et al. (1996), Duda et al. (2000) and Hastie et al. (2001) 
offer different interesting perspectives of the work done in discrimination theory since 
Fisher's pioneering paper. All of them focus on the standard multivariate case J-" = W^. 
Many classifiers have been proposed as an alternative to Fisher's linear rule in this 
finite-dimensional setup. One of the simplest and easiest to motivate is the so-called 
/c-nearest neighbors method. Fixed a positive integer value (or smoothing parameter) 
k = kn this rule simply classifies an incoming observation x in the population Pi if the 
majority among the k training observations closest to x (with respect to the considered 
distance D) belong to Pi. More concretely the k-NN rule can be defined by 

gn{x) = I{rj„{x)>l/2}, (2) 

where 

1 " 

^"(^) = hx^ekix)}Yi (3) 

1=1 

and "Xj G k{x)" means that Xi is one of the k nearest neighbors of x. 

In fact, the definition of the /c-NN rule is extremely simple and can be introduced 
(in terms of "majority vote among the neighbors") with no explicit reference to any 
regression estimator. However, the idea of replacing the unknown regression function 
?7(x) in the optimal classifier ([1]) with a regression estimator (given by (|3]) in the case of 
the k-NN rule) is very natural. It suggests a general methodology to construct a wide 
class of classifiers by just plugging in different regression estimators ?7„ in ([T]) instead 
of the true regression function //(x). In the finite dimensional case J-" = M'^ this is 
a particularly fruitful idea, as a wealth of different (parametric and nonparametric) 
estimators of r]{x) is available; see Audibert and Tsybakov (2007) for some reasons in 
favor of the plug-in methodology in classification. The main purpose of this work is 
to show that the plug-in methodology can be also successfully used for classification in 
some functional data models. 

Discrimination of functional data. Differences with the finite- dimensional case 

We are concerned here with the problem of (binary) supervised classification with 
functional data. That is, we assume throughout that the space (J-", D) where the data 
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Xi live is a separable metric space (typically a space of functions). For some theoretical 
results, considered below, we will impose more specific assumptions on J-". 

The study of discrimination techniques with functional data is not as developed as the 
corresponding finite-dimensional theory but, clearly, is one of the most active research 
topics in the booming field of functional data analysis (FDA). Two well-known books 
including broad overviews of FDA with interesting examples are Ferraty and Vieu (2006) 
and Ramsay and Silverman (2005). A recent survey on supervised and unsupervised 
classification with functional data can be found in Bai'Uo et al. (2009). 

While the formal statement of the functional classification problem is very much 
the same as that indicated at the beginning of this section, there are some important 
differences with the classical finite-dimensional case. 

(a) Lack of a simple functional version of Fisher's linear rule: As mentioned above, the 
idea behind Fisher's rule requires to invert the covariance operator. When = W^ 
this is increasingly difficult as the dimension d increases, but it becomes impossible 
in the functional framework where the operator is typically not invertible. Thus 
the applicability of Fisher's linear methodology to functional data is a non-trivial 
issue of current interest for research. See, for instance, James and Hastie (2001) 
and Shin (2008) for interesting adaptations of linear discrimination ideas to a 
functional setting. 

(b) Difficulty to implement the plug-in idea: Unlike the finite-dimensional case, the 
plug-in methodology is not generally considered as a standard procedure to con- 
struct functional classifiers. When x is infinite-dimensional, there are yet few 
simple parametric models giving a good fit to the regression function and the 
structure of nonparametric estimators of rj is relatively complicated. 

(c) The k-NN functional classifier is not universally consistent: In the discrimination 
problem a sequence of classifiers {(?„}, based on samples of size ra, is said to be 
"consistent" when the corresponding sequence of classification errors converges, as 
n tends to infinity, to the "lowest possible error" attained by the Bayes classifier 
([I]); see Section [3] below for more details. It turns out (see Stone, 1977) that, in 
the case of finite-dimensional data Xi G M'^, any sequence of /c-NN classifiers is 
consistent provided that kn ^ oo and kn/n — )• 0. Since such consistency holds irre- 
spectively of the distribution of the data (AT, F), this property is called "universal 
consistency" . 

The definition of the A;-NN classifier can be easily translated to the functional 
setup (by replacing the usual Euclidean distance in with an appropriate func- 
tional metric D). However, the universal consistency is lost. Cerou and Guyader 
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(2006, Th. 2) have obtained sufficient conditions for consistency of the /c-NN 
classifier when X takes values in a separable metric space. Nevertheless, the re- 
quired assumptions are not always trivial to check. As the fc-NN rule is a natural 
"default choice" in infinite-dimensional setups, an important issue is to ensure its 
consistency, at least for some functional models of practical interest. 

The purpose and structure of this paper 

This work aims to partially fill the gaps pointed out in the points (b) and (c) of 
the above paragraph. To this end, in Subsection 12.11 a simple expression is obtained 
for the Bayes (optimal) rule g* in the case that both distributions, /iq and /ii, are 
equivalent. However, g* turns out to depend on the Radon- Nikodym derivative d^o/d^i 
which is usually unknown, or has an extremely involved expression, even when /io and 
/ii are completely known. An interesting exception is given by Gaussian processes 
with a specific type of covariance functions, called "triangular". For these processes 
the Radon-Nikodym derivative has been explicitly calculated by Varberg (1961) and 
J0rsboe (1968) whose results are collected and briefly commented in Subsection 12.21 
In Subsection 12.31 parametric plug-in estimators for g* are obtained by assuming that 
yUo and /ii are either (parametric) Brownian motions or Ornstein-Uhlenbeck processes. 
Non-parametric plug-in estimators for g* are proposed and analyzed in Subsection \2A\ 
under the sole assumption that the covariance functions are triangular. Since the proofs 
of the results in this subsection are rather technical, they are deferred to a final appendix. 
This concludes our contributions regarding issue (b). Section 3 is devoted to the /c-NN 
consistency problem introduced in (c): we use the above-mentioned result by Cerou 
and Guyader (2006) to show that the /c-NN rule is consistent in functional classification 
problems where the data are generated by certain Gaussian triangular processes specified 
in Subsection 12. 2[ 

Finally, in SectionlHthe practical performance of the plug-in rules proposed in Section 
[2] is checked, and compared with the fc-NN rule, through a simulation study and the 
analysis of a real data example. 

2 The optimal classifier for a Gaussian family 

2.1 A general expression based on Radon-Nikodym derivatives 

When the distributions /iq and /ii of Pq and Pi are both absolutely continuous with 
respect to some common cr-finite measure /x, it is easy to see, as a consequence of Bayes 
formula, that the optimal rule is 
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where p = F{Y = 0} and /q, /i are the /x-densities of Pq and Pi, respectively. 

The expression (jll) is particularly important in the finite dimensional problems with 
J-" = R'^, where the Lebesgue measure /i arises as the natural reference measure and 
the corresponding Lebesgue densities can be estimated in many ways. In the infinite- 
dimensional spaces there is no such obvious dominant measure. However if we assume 
that /io and /ii, with supports 5*0 and Si, are absolutely continuous with respect to each 
other on 5*0 fl 5*1, the optimal rule can be also expressed in a simple way with respect 
to the Radon-Nikodym derivative dfio/dfii as shown in the following result. 

Theorem 1 Assume that /xq << /^i and /ii << /iq on S = Sq H Si. Then 



■r][x) 



ifxeSoDS' 

1 ifxeSiCiS'^ 

1 — p 



(5) 



if X E S. 



provides the expression for the optimal rule g*{x) = I{ri{x)>i/2} ■ 

Proof: Define n = fio + fii. Then /ij << /i, for i = 0,1, and we can define the Radon- 
Nikodym derivatives fi = dfii/dfi, for i = 0, 1. From the definition of the conditional 
expectation we know that ri{x) = E{Y\X = x) = P{Y = 1\X = x) can be expressed by 

"^^""^ /o(x)p + /i(x)(l-p)- 

Observe that fi\s-nSi= l^i\s-nSi and thus fi\scnSi= h-nSo for i = 0,1. Since /iq << /ii 
and /ii << /io on S* then, on this set, there exists the Radon-Nikodym derivatives 
duo/dni and d^i/dfiQ. In this case, it also holds that ^\s« l-k\s, for both i = 0,1 and 

-^(x) = 1 + '^^^"^ (x), for any x G 5". 
dni dfii 

Then (see, e.g., FoUand 1999), for i = 0,1 and for P^-a.e. x G S", 

/.w=^w=(I^w)"=-t:— (7) 

djj, ydfii J 1 + ^^(x) 

Substituting ([7]) into expression ([6]) we get ([5]). □ 

The mutual absolute continuity is not a very restrictive assumption if we deal with 
Gaussian measures. According to a well-known result by Feldman and Hajek (see Feld- 
man, 1958) for any given pair of Gaussian processes, there is a dichotomy in such a way 
that they are either equivalent or mutually singular. In the first case both measures /io 
and /ii have a common support S. As for the identification of the support, Vakhania 
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(1975) has proved that if a Gaussian process, with trajectories in a separable Banach 
space J-", is not degenerate (i.e., the distribution of any non-trivial linear continuous 
functional is not degenerate) then the support of such process is the whole space T . 

In any case, expression ([5]) would be of no practical use unless some expressions, 
reasonably easy to estimate, can be found for the Radon-Nikodym derivative d^o/d^i. 
This issue is considered in the next subsection. 

2.2 Explicit expression for a family of Gaussian distributions 

The best known Gaussian process is perhaps the standard Brownian motion {W{t), t > 
0}, for which 'K{W{t)) = and the covariance function is Cov{W{s), W{t)) := T{s,t) = 
min(s,t). A wide class of Brownian-type processes can be obtained by location and 
scale changes of type m{t) + aW{t), where m{t) is a given mean function and a > 0. 

In fact, the covariance structure r(s,t) = min(s,t) can be generalized to define a 
much broader class of processes with r(s,t) = M(min(s, t)) f (max(s, t)), where u and v 
denote suitable real functions. Covariance functions of this type are called triangular. 
They have received considerable attention in the literature. For example. Sacks and 
Ylvisaker (1966) use this condition in the study of optimal designs for regression prob- 
lems where the errors are generated by a zero mean process with covariance function 
r(s, t). It turns out that the Hilbert space with reproducing kernel K plays an important 
role in the results and, as these authors point out, the norm of this space is particularly 
easy to handle when F is triangular. On the other hand, Varberg (1964) has given an 
interesting representation of the processes X(t), < t < b, with zero mean and trian- 
gular covariance function. This author proved that they can be expressed in the form 
X{t) = Jq W{u) duR{t, u), where W is the standard Wiener process and R = R{t, u) is 
a function, of bounded variation with respect to n, defined in terms of F. 

The so-called Ornstein-Uhlenbeck model, for which F(s,t) = (T^exp(— — t\) 
(/3, 0" > 0), provides another important class of processes with triangular covariance 
functions. They are widely used in physics and finance. 

The following theorem is due to Varberg (1961, Th. 1) and J0rsboe (1968, p. 
61). It shows that the Radon-Nikodym derivative can be expressed in a closed, rel- 
atively simple way for these special classes of Gaussian processes. For more informa- 
tion concerning explicit expressions of Radon-Nikodym derivatives for Gaussian pro- 
cesses see Segall and Kailath (1975) and references therein. From now on let us denote 
rriiit) = E(X(t)|F = i). 

Theorem 2 Let (J", D) = (C[0, 1], || ■ ||). Assume that X\Y = i, for i = 0,l, are Gaus- 
sian processes on [0,1], with covariance functions Fj(s,t) = Mj(min(s, t)) f j(max(s, t)), 
for s,t G [0,1], where Ui,Vi, for i = 0,1, are positive functions in C^[0, 1]. Assume 
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also that Vi, for i = 0,1, and Viu[ — Uiv[ are bounded away from zero on [0, 1], that 



UiV-, 



u[vi = uqv'q — u'qVo and that ui{0) = if and only if Uo{0) = 0. 



a) Assume that nii = 0, for i = 0,1. Then there exist some constants Ci, C2, C3 and a 
function F , whose expressions are given in the proof, such that 



djjii 



[x) = Ci exp 



Ucsx\0) + C2x'{l)- 



x\t) 



t;o(tK(t) 



(8) 



b) Assume now that the covariance functions are identical, i.e. Ui = u and Vi = v 
for i = 0,1, that mi = 0, mo is a function m G C^[0, 1], such that m(0) = 
whenever u{0) = 0. Then there exist some constants Di,D2 and a function G, 
whose expressions are given in the proof, such that 



djjLi 



[X 



exp < D 



'xify 
v{t) 



dG{t) \ . (9) 



Proof: 



a) Varberg (1961, Th. 1) shows that, under the assumptions of (a), yUo and /ii are 
equivalent measures. The Radon-Nikodym derivative of /io with respect to /ii is 



djjiQ 
djjii 



[X) 



Ci exp 



CiX^{0)+ / F{t)d 



x\t) 



vo{t)vi{t) 



(10) 



where 



?^o(0)fi(l) 
fo(l)fi(0) 

vo{l)uo{0) 



1/2 



1/2 



if uo(0) = 
if uo{0) ^ 



if Mo(0) = 



voiO)uo(0)-ui{0)vi(0) 
fi(0)t)o(0)«o(0)Mi(0) 



1/2 



if Mo(0) 7^0 



and F = (fifg — fo^'i)/(^'i^'i ~ uiv[). 

Observe that, by the assumptions of the theorem, F is differentiable with bounded 
derivative. Thus F is of bounded variation and it may be expressed as the dif- 
ference of two bounded positive increasing functions. Therefore the stochastic 
integral (|TOl) is well defined and it can be evaluated integrating by parts, leading 
to conclusion ©, with C3 = C4 - F{0)/vo{0)vi{0) and C2 = F(l)/t;o(l)t'i(l). 

b) In J0rsboe (1968), p. 61, it is proved that, under the indicated assumptions, /xq 
and fii are equivalent measures with Radon-Nikodym derivative 



dfii 



exp \D3 + D2 x{0) + 



1 



f 2x{t)-m{t) 
""^'^^ [ v{t) 
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with 

_ m^(0) _ m(0) 

and G = {vm' — mv') / {vu' — uv'). Again, the integration by parts gives where 
Di = D,-J^Gd{m/v). □ 

In the general case where ttiq ^ mi and Fq 7^ Fi, let us denote by Pm,r the distribu- 
tion of the Gaussian process with mean m and covariance function F. Then, applying 
the chain rule for Radon- Nikodym derivatives (see, e.g., Folland, 1999) we get 

— {x) = — {x) = — (x)— — [x)— [x). (11) 

Under the appropriate assumptions the expressions of the Radon-Nikodym derivatives 
in the right-hand side of (fTTj) are given in ([8]) and (Q. 



2.3 Parametric plug-in rules 

The aim of this subsection is twofold. First and foremost, we show how the theoretical 
results of Subsections 12.11 and 12.21 become useful in practice. To this end, we consider 
examples of well-known Gaussian processes that fulfill the requirements of Theorems 
[1] and El namely Brownian motions with drift and Ornstein-Uhlenbeck processes. We 
derive the expressions of the Radon-Nikodym derivatives dfi^/dfii for these examples. 
Then, it is straightforward to compute the Bayes rule g* for classification between two 
elements of one of these families. In these particular examples the mean and variance 
of the Gaussian process X\Y = i have known parametric expressions (up to a finite 
number of parameters). Thus g* is completely specified as long as the parameters have 
known values. When this is not the case, we can substitute each unknown parameter 
in g* by some estimate. The resulting discrimination procedure is called the parametric 
plug-in rule. In particular, for the Bayes rules given in ( IT2l) . ( IT3l) . IHM and (fTSll below 
the explicit expression of the parameter estimates is given in the appendix. 

The second objective of Subsection 12.31 is to obtain the expressions of the Bayes 
rules for the models used in Section |4] and to derive the corresponding parametric plug- 
in versions. 

Two Brownian motions 

Let us denote X{t]i) = {X{t)\Y = i). In the Brownian case, using the standard 
notation in stochastic differential equations, X(t;i) is just the solution of dX{t;i) = 
rriiit) dt + aiWiit) dt, for i = 0,1 and t G [0, 1]. Here mi = 0, mo(t) = ct, < c < 00 is a 
constant, Wq and Wi are two uncorrelated Brownian motions and (X(0;i) ~ N{0,9f). 
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Then, if ctq = o"i = o", the conditions of Theorem [2] are satisfied with Ui{t) = Of + a'^t 
and Vi = 1, for z = 0, 1. 

When 9q = 9i = 0, we have X(0; z) = and, for any x G 5*, 

d^Q f c 



x) = exp |^(2x(l) - c)| . 



dfii i a 

Thus the Bayes rule is 

g*{x) = I{x(i)<c/2}- (12) 

li 6i ^ for i = 0,1, then X{0;i) is random and a similar calculation yields that the 
Bayes rule classifies x in population Pi whenever 

c „ 1/1 n j,„, /'Co 



^,|2Wl)-.(0))-c| + -^^^-^j.^(0)<log^^^j. (13) 

Replacing the unknown parameters, c, a and in ( IT2|) and ( IT3|) by estimates, we obtain 
the corresponding parametric plug-in rules. 

When (To 7^ cti, then Uj(t) = 9f + crft, = 1, for i = 0,1, and the hypothesis 
Uiv[ — u[vi = Uqv'q — u'qVq in Theorem 2 is not satisfied. In fact, if this last equality 
does not hold, by Theorem 1 in Varberg (1961) we know that /io and /ii are mutually 
singular. 

Two Ornstein- Uhlenbeck processes 

Let X\Y = i, for z = 0, 1, be Ornstein-Uhlenbeck processes given by 

dX{t; z) = - A iX{t; z) - r],) dt + v^c^i dWi{t), 

where Wq and Wi are two independent Brownian motions and /3j > 0, > 0, rji are 
constants. 

If X{0;i) is equal to a constant Cj, we have that mj(t) = rji + (q — rji)e~^^^ and 
ri(s,t) = (e-^'l^-*l - e"^'l'^+*l). Fixing Wi(l) = 1, we get u,{t) = afe~^'{eP'^ - e-^^^) 
and Vi{t) = e'^'*-^"*'' for i = 0,1. The condition Uiv[ — u[vi = Uqv'q — u'qVq in Theorem [2] 
is fulfilled if and only if /SqO'o ~ f^i'^i- Also, since Mj(0) = 0, then mj(0) = Cj has to be 
for z = 0, 1. Then it is straightforward to check that the Bayes rule g* classifies x in 
population Pi if 

> 2 (/32(a2 - r/2) - (3l{al - r/?)) + 4x(l)(r/o/3o - Vif^i) + (/^i - /3o) x\l) 

+ 4(r/o/3o-m/3i) /" x{t)dt + {Pl-Pl) I x\t)dt. (14) 



When X(0; z) is random, it follows a normal distribution with mean rji and variance 
af. Then mi(t) = r/i, for all t e [0, 1], and Ti{s,t) = a^e"^'!"-*!, Ui{t) = a^e-^^^^"*) and 
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Vi{t) = e^^^^ *\ Consequently, the Bayes rule assigns x to population Pi if 

2/3ia?(log(/3i) - log(/3o)) > 2 [Plal - Plal + I3^r,l{l + /3i) - P,r^l{l + /3o)] 

+4x(l)(r/o/3o - + 4 (r/o/3o' - ^i/^i) / ^^(i) dt 

Jo 

+(/3i-/3o) x2(0)+x2(l) + (/3i + /3o) / x\t)dt . (15) 

^0 

The parametric plug-in classification rule is derived by substituting the unknown 
parameters rji and cTj, z = 0, 1, in ( HM and ( fT5l) with their corresponding estimators. 

2.4 Nonparametric plug-in rules 

In this section we analyze the situation in which the processes ultimately belong to 
the Gaussian family fulfilling the conditions of Theorem |21 but we do not place any 
parametric assumption on the mean and the covariance functions. However, let us note 
that, until we get to the estimation of the Radon-Nikodym derivatives, the Gaussianiaty 
assumption is not needed. Specifically, we only assume that the covariance functions of 
the involved processes are of type r(s, t) = M(min(s, (max(s, t)), for some (unknown) 
real functions u, v where v is bounded away from on the interval [0, 1]. 

Observe that, in order to use a plug-in version of the optimal classification rule 
along the lines of Theorems [1] and [21 we need to estimate the functions m, u and 
V as well as their first and second derivatives. Since these estimation problems have 
some independent interest, in this subsection we consider them in a general setup, not 
necessarily linked to the classification problem. Thus we use the ordinary iid sampling 
model with a fixed sample size denoted, for simplicity, by n in all cases. 

Regarding u and v, let us note that the condition T{s,t) = u(mm{s,t))v(ma.x{s,t)), 
for s,te [0,1], entails u{s) = r(s,l)A;(l) and v{t) = r{0,t)/u{0) 11^(0) > 0. However, 
it is clear that these conditions only determine u and v up to multiplicative constants 
so that one can impose (without loss of generality) the additional assumption v{l) = 1. 
Thus, it turns out that u and v can be uniquely determined in terms of r(0,t) and 
r(s, 1). Our study will require three steps: first, the estimation of the mean function m 
and its derivatives, then the analogous study for r(0, t), T{s, 1) and o"^(t) := T{t, t) and, 
finally, the analysis of more involved functions defined in terms of these. 

In Propositions 1 to 3 below we assume that the sample data are Xi, . . . iid 
trajectories of a process X in the space C[0, 1], endowed with the supremum norm, || ■ ||. 



Estimation of the mean and covariance functions and their derivatives 

To estimate the mean function m(t) = E [X(t)] and its derivatives, we will only need 
to assume that satisfies that E||Xi||^ < oo, which (see p. 172 in Araujo and Gine, 
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1980) implies that the distribution of Xi satisfies the Central Limit Theorem (CLT) in 

(^^[o,i],MI). 

The natural estimator of m is the sample mean, denoted by rhn{t) = Yl^=i-^ii^) f''^- 
Since the derivatives of m are also involved in the expressions of the Radon-Nikodym 
derivatives obtained in Theorem [2l we will also need to consider the estimation of m' and 
m". Our estimators will depend on a given sequence hn oi smoothing parameters. 
Given t G [h^, 1 — hn], define 

rhn{t + hn) - rhn{t - hn) ^„u\ rhn{t + hn) + rhn{t - hn) - 2mn{t) 
mnit) := , runit) := . 

For t G [0, /in), we define 

„ m„,(t + /i„) - m„(0) m„(t + /i„) +m„(0) -2m„,(7„) 

"^nit) ■■= J——. , ^nit) ■■= -2 • 

'% "T t In 

where 7n = {t + hn)/2. The definition of and m'^ on (1 — hn, 1] is similar. These 
definitions allow us to handle analogously the extreme points and the inner ones. Thus 
we will not pay special attention to the extreme points in the proofs. 

There is a slight notational abuse in these definitions as, for example, rh'^{t) is not 
the derivative of rhn(t) but an estimator of m'it). We keep this notation throughout the 
manuscript for simplicity. 

As mentioned at the beginning of this section, due to the triangular structure of F, 
in principle we should only concentrate on the estimation of the functions s (-> F(s, 1) 
and t H-> F(0,t) and their derivatives. However, due to technical reasons we will also 
need to consider the function cr'^{t) = T{t,t) and its derivatives. Natural nonparametric 
estimators of these functions can be given in terms of the empirical covariance 

tn{s, t):=^Yl (^^^^) ~ "'"^^)) (^^(^) - ^"^^)) ' ^ ^ [0' 

i 

The estimation of the required derivatives is carried out in an analogous way as we 
did with the mean function. Observe finally that, since v{l) = 1, we can estimate 
u{t) = F(t, 1) by Un{t) '■= F„,(t, 1) for any t G [0, 1] and similarly for its first two 
derivatives. Regarding the function cr^, we estimate cr^(t) by o"^(t) := Tnit,t). 

Proposition 1 Let be iid trajectories in C[0, 1] of a process such that E||Xi||^ < 

cxo and whose mean function m : [0, 1] — ?■ M has a Lipschitz second derivative. 

a) For the mean estimation problem we have, 

\\m-mn\\ = Op{n-^/^) (16) 



\m' — 



= Op{in'/'hn)'')+0{hl) (17) 
m"-m:\\ = Op{{n'/^hl)-')+0{hn) (18) 
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b) Assume that E||Xi||^ < oo and that the functions t r(t, 1), t — r(0,t) and o"^ 
admit Lipschitz second order derivatives. Then, we have 

||f„(-, 1) - r(-, 1)11 = - = Op{n~'/^), (19) 
l|fU-, 1) - r'(-, 1)11 = \K - A = Op [{n'/'hr^)-') + 0{hl), (20) 

iif::(-, 1) - r"(., 1)11 = IK - u"\\ = Op {{n'^xy') + o{h^), (21) 

Similar results also hold for r„,(0, ■) and a^. 

From the proof of this proposition (see the Appendix) it can be checked that the 
assumption EHXiH"^ < oo can be replaced with E||Xi||^"'"'^ < oo, for some 6 > 0, and 
E(X''(1)) < oo for any r > 0. 

Estimation of v 

The estimation of v is harder than that of u. It will be useful to distinguish two 
cases, where the estimators must be defined in different ways. In the case u{0) > 
(corresponding to the case cr^(O) > 0) we have v(t) = T{0,t)/u{0) which is estimated by 

t)n(t):=-r^f„(0,t),te[0,l]. (22) 

When m(0) = (which implies that a"^(0) = 0), the estimator proposed in ( 122|) 
is, at best, highly unstable. This case is not unusual: see, for instance, the examples 
introduced in Subsection 12.31 when X{0)/Y = i is constant. For the sake of simplicity 
from now on assume that cr^(t) > for t G (0, 1). 

The first step is to define i)„(t) = a^{t)/un{t) for t G [Sn, 1], where 6n is a sequence 
of positive numbers converging to zero (whose rate will be determined later). Then we 
define estimates for the first and the second derivatives of v on the same interval. The 
structure of Vn as a quotient suggests defining, on [6n, 1], 

Vn ■= — [M Un - U^<yn) , 

ut 



Where (<T^)'(t) = f:,(t, t), (a^)"(t) = f'^(t, t) 

Now we complete the definition of our estimator of v on the whole interval by using 
a Taylor-kind expansion on [0,(5„), 

V^{t) = Vn{6n) + (t - 6n)K{^n) + \{t ' ^nfK{^n): if t G [0, (5„). (23) 

Finally, take 

v"(t) := v"{6n), ifte[0,6n). 
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Proposition 2 Let the assumptions of Proposition^ (b) hold. 



a) lfu{0) > then the rate of convergence of \\vn — v\\, \\v'^ — v'\\ and ||'0^ — "^"11 are the 
same as those of /fi^). ^2^) and l[21\). respectively. 

b) If u(0) = assume that mitu'{t) > and inf^g^^ i] (T^(t) > for every S > 0. Let 
{Sn} 10 be such that sup(n~^/^, /i„) = o(5„,). Then 

\K - "11 = Op (-J^) + 0(K) + 0(dl) 



n 



Estimation of the Radon-Nikodym derivatives 

Here we plug-in the estimates of m, u, v and tlieir derivatives obtained above in the 
Radon-Nikodym derivatives / = dfio/dfii obtained above in Theorem [21 Denote by 
the resulting estimate. Then, we compute the convergence rate to the Bayes risk of the 
error attained by the corresponding nonparametric plug-in classification procedure. 

According to Theorem [2]the Radon-Nikodym densities of interest are the exponential 
of some integrals, ratios, products or square roots of functions estimated with orders 
of convergence appearing in Propositions [1] and [2l The final rate will be that of the 
worst estimate handled, which corresponds to the second order derivatives. As with the 
estimation of v, there is some difference in the orders depending on whether cr^(O) is 
strictly positive or not. 

The main conclusions are summarized in the following result. 

Theorem 3 Let us assume that conditions in Proposition^ (b) and Theorem\^hold. 

a) If Ui{0) > for i = 0,1, then for hn = 0{n~^^^) we get 

logUx) - log^(x) = Op (n"V6) ^ ^ e c[0, 1]. 

b) // Ui{0) = for i = 0,1 and inf^ u'(t) > and mfte[5,i] > for every 6 > 0, 
then, for hn = 0{n~^^^^) we have 

E(^log/„(X)-log|^(X) X„...,Xn^=Op{n-'/''). 
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Let us note that, in any case, our nonparametric estimator = dPrhotol ^^miti 

is constructed, using ( fTTl) . under the sole assumption that the covariance function has 
a triangular structure. So, the estimator is formally the same in both ) and b) 

of Theorem |2l If we knew that mj = for z = 0, 1 then we could employ fn{x) = 
d-^rhotol d^rhoti ^"^^ rates of Theorem [3] would improve, under the assumptions of 
Theorem Elb), to Op{n-'^/'^^). 

Using higher order derivatives 

The proof of Theorem [3] was based on the use of Taylor expansions of order two. 
Next we show how the existence of higher order derivatives improves the estimation 
process. 

Proposition 3 Under the assumptions of Theorem suppose further that the mean 
function m : [0, 1] — M as well as the functions t r(t, 1), t r(0,t) and admit 
Lipschitz third order derivatives. Then the rates in Theorem 3 a) and b) are improved 
to Op{n~^^'^) and Op{n~^^^'^) , respectively. 

A remark similar to that made after Theorem [3] applies here. If we incorporate the 
information = to the estimator, the convergence rate in Proposition [3] b) slightly 
improves to Op{n~^^^). 

The convergence orders may be further improved by assuming additional smoothness 
orders and taking advantage of numerical differentiation techniques (see, for instance, p. 
146 in Gautschi, 1997). We will not develop this idea in the present work. However, let 
us observe that in the estimation of functions with infinite derivatives it is possible to 
obtain orders as close to Op(n~^/^) as desired by choosing k large enough in the /c-point 
rule (see, for instance, Herzeg and Cvetkovic, 1986). 

Estimation of the probability of misclassification 

We denote by I/„ := L{gn) = P{^„(X) 7^ the classification error associ- 

ated with the nonparametric plug-in rule ^„(a:) = I{r)„(x)>i/2}- Here ffn is obtained by 
substituting the Radon-Nikodym derivative / = dfio/dfii in ([5]) with the estimator 
obtained by replacing m, u, v and their derivatives with the corresponding nonparamet- 
ric estimators obtained along this subsection. The following result is an example of how 
the convergence rates for the difference between the logarithms of the Radon-Nikodym 
derivatives fn{x) and f{x) can be translated into convergence rates of to the Bayes 
error L* . 

Theorem 4 Let the assumptions of PropositionUl (b) and TheoremlE hold. If Ui{0) > 
for i = 0,1, then taking hn = 0{n~^^^) we get Ln — L* = Op (n"^/^) . 
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In the case when Mj(0) = 0, for i = 0, 1, we can prove that Z/„ — L* is Op(n~^/^°) 
under the assumptions that inf(u'(t) > and inf^g^^^i] cr^(t) > for every 6 > 0. The 
idea is to follow the same steps as in the proof of Theorem HI but bounding the integrals 
in fl38|) and P2l) as we did along the proof of Theorem 3. 



3 Consistency of the /c-NN functional rules 

As stated in the introduction, the fc-NN classifier is not universally consistent in the 
functional setting. However, Cerou and Guyader (2006) provide sufficient conditions 
for the consistency L„ — )• L* in probabihty (or, equivalently, ]E(L„) — )■ L*), where Ln 
is the conditional classification error of the /c-NN rule. In this section we show that 
these conditions are fulfilled by the Gaussian processes introduced in Section 12.21 and, 
in consequence, that the /c-NN is consistent in probability for them. 

Throughout this section the feature space where the variable X takes values is a 
separable metric space {J^,D). As usual, we will denote by Px the distribution of X 
defined by Px{B) = F{X e B} for B G Bjr, where Bjr are the Borel sets of J". 

The key assumption is a regularity condition on the regression function ri{x) = 
K[Y\X = x) which is called Besicovich condition (BC). The function i] is said to fulfill 
(BC) if 

lim— -— — -/ r]{z)dPxiz) =1]{X) in probability, 

5^0 Fx{Bx,5) Jbx.s 

where B^s '■= {z ^ T : D{x,z) < 6} is the closed ball with center x and radius 6. 
Besicovich condition plays, for instance, an important role in the consistency of kernel 
rules (see Abraham et al. 2006). 

Cerou and Guyader (2006, Th. 2) have proved that, if (J-", D) is separable and 
condition (BC) is fulfilled, then the fc-NN classifier defined by ([2]) and ([3]) is consistent 
in probability provided that oo and /c„/n — )■ 0. In order to apply this result in our 

case, it will be sufficient to observe that the continuity (Px-a-e.) of ri{x) implies also 
(BC). Consequently we can establish the following result, whose proof is immediate 
from Theorems [1] and [2l 

Proposition 4 Under the assumptions of TheoremUl suppose that Px{dS) = 0. Then 
for Px-a.e. x,z in the topological interior of S, 



\ri{z) -ri{x)\ 



p L — p 



p 



dfii dj^i 



■ (24) 



As a consequence, for both cases a) and b) considered in Theorem\^the k-NN functional 
classifier is consistent in probability, provided that /c„ — oo and kn/n 0. 
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Of course, the point is that the Radon- Nikodym derivatives given in Theorem [2] are 
continuous on C[0, 1]. So (IMll would imply also the continuity of ^^(x) which in turn 
entails the Besicovich condition (BC) and the consistency. 



4 Empirical results 

In this section we compare the performance of the /c-NN classification procedure with 
the plug-in one for infinite-dimensional data. First (Subsection 14. ip we describe the 
results of a simulation study carried out with processes from the two Gaussian families 
specified in Subsection 12.31 Afterwards (Subsection 14. 2p we focus on a real-data set. 

4.1 Monte Carlo study 

The observations will be realizations of two Ornstein-Uhlenbeck processes and two Brow- 
nian motions as described in Subsection 12.31 The parameters chosen for the pairs of 
processes are specified in Tabled] (in Figured] we have depicted some trajectories of the 
processes used in the simulations). 

We assume that p = F{Y = 0}, the proportion of observations coming from Pq, 
is 1/2 and is known in advance. For each z = 0, 1 we take a training sample with 
size Hi = 100 and a test sample with size 50 from Pj. The processes are observed at 
equidistant times of the interval [0, 1], = 0,ti, . . . ,tN = 1, with = 50. We denote 
by A = tj — tj-i the internodal distance. The number of Monte Carlo runs is 1000. 
In each run we use the training sample to construct four classifiers: fc-NN with the 
supremum norm and with a PLS-based semimetric (see e.g. Ferraty and Vieu, 2006, 
p. 30), parametric and nonparametric plug- in as introduced in Subsections 12.31 and 
12.41 respectively. The performance of these classifiers is assessed by the proportion of 
correctly classified observations in the test samples. We also compute this proportion for 
the Bayes rule associated to each model. The number k of neighbours and the number 
of PLS directions for projection are chosen via cross-validation from a maximum of 10 
neighbours and 5 PLS directions respectively. 

When applying the nonparametric plug-in method to the data functions evaluated 
on the whole interval [0, 1] we observed a noticeable boundary effect near 0, especially 
in the estimation of v and its derivatives. This made the nonparametric plug-in method 
perform poorly. In order to avoid this, the Radon-Nikodym derivative for the nonpara- 
metric plug-in rule has been evaluated on the trajectories restricted to the interval [/i„, 1], 
where hn is the same (and unique) smoothing parameter used in the estimation of the 
derivatives of Uj and Vi. The value of hn has been chosen among {2A, 4A, . . . , 20A} via 
cross-validation: for each hn = kA we compute the corresponding estimated classifica- 
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tion error with the usual leave-one-out device (every training observation is classified, as 
if it were a new incoming observation, using the remaining data as a training sample). 

In Table [H we display the mean and the standard deviation (between parentheses) 
of the proportion of correct classifications over the 1000 Monte Carlo samples. We see 
that the parametric plug- in procedure is the one performing best: it is very near the 
optimum. 

As it could be expected, the nonparametric plug-in behaves worse than the para- 
metric one. Its best performance corresponds to the random start cases Mj(0) > for 
i = 0,1. In these situations, it is the second better classifier. When Ui{0) = 0, the 
parametric plug-in is still the winner, the A;-NN with PLS is the second and the /c-NN 
with the supremum metric and the nonparametric plug-in perform similarly. 

It is interesting to note that the /c-NN classification method is always reliable (even 
with the supremum metric, although PLS semimetric yields better results). Thus one 
of the conclusions of the study is that, when classifying functional data, the /c-NN 
procedure is generally a safe choice, free of model assumptions. 



Table [T 
here . 



4.2 A real data set 

We compare the performance of the /c-NN classification procedure with the nonpara- 
metric plug-in one in the analysis of data from research in experimental cardiology. 
The experiment was conducted at the Vail d'Hebron Hospital (Barcelona, Spain). See 
Ruiz-Meana et al. (2003) for biochemical and medical details on the data and Cuevas, 
Febrero and Fraiman (2004, 2006) for previous analysis of these observations. 

The variable under study is the mitochondrial calcium overload (MCO), which mea- 
sures the level of the mitochondrial calcium ion (Ca2-(-). This variable was observed 
every 10 seconds during an hour in isolated mouse cardiac cells. The aim of the study 
was to assess whether a drug called Cariporide increased the MCO level. The data we 
analyze here consist of two samples of functions with sizes no = 45 (control group) and 
rii = 44 (treatment group with Cariporide). In Figure [2] we display (a) all the data and 
(b) the group means. 

In many cases the first three minutes each curve shows oscillations which correspond 
to normal contractions of the cells. This first part of the curves has been eliminated (as 
in the original experiments with these data) because it has high variability and depends 
on uncontrolled factors. 

To obtain a better approach of the distributions to normality, we have considered a 
transformation of the data, X = log(MCO — 85). The performance of any of the clas- 
sification procedures considered is described by the probability of correctly classifying 
one of the transformed observations, approximated via cross-validation. 



Figure [2 
here . 
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Obviously, in this case, we do not have enough information to consider using the 
parametric plug-in classifier. Consequently we only employ the /c-NN (with uniform 
metric and PLS-based semimetric) and the nonparametric plug-in discrimination rules. 
The results appear in Table |2j It is interesting to notice that the results in this case, in 
some sense, are the opposite to those obtained with the simulations. The nonparametric 
plug-in clearly outperforms the other two and the /c-NN with the supremum metric does 
better than the A;-NN with PLS. 

Acknowledgement. The authors want to thank Javier Segura for bringing to our 
knowledge some numerical differentiation techniques (in particular, the /c-point rule). 



5 Appendix 

A.l Parameter estimation for the models of Subsection 12.31 

Two Brownian motions 

In the simulations of Section H] the estimator of c is c = argmin^ Ylj^=ii^oi^j) ~ctj)^, 
where rrii is the sample mean of the observations coming from Pj. The parameters 
6i and cr^ are respectively estimated by 6i = ~ ''^j(O))^ /(^j ~ 1) ^^id 

= E.=o,i E;=i (^.(1; - ^i(i) - ^.(0; ^) + mio))' /K + n,-i). 

Two Ornstein- Uhlenbeck processes 

The estimation of the unknown parameters {Pi, rji and Uj, i = 0,1) is carried out 
via linear least-squares regression between the realizations of the process at consecutive 
time points. The main idea is that, for i = 0,1 and for any < s < t < 1, we have 

X{t; i) = X{s; i) e-^^^'-'^ + V^i^- e~^^^'^'^) + ^^^1 - e'^ft Z, (25) 

where Z is A^(0, 1). The updating formula (l25l) is valid when X{0; i) is either determin- 
istic or random. In particular, for i = 0, 1, k = 1, . . . ,ni and j = 0, . . . , N — 1, 

Xk{tj+u i) = aiXkitf, + h + aiVl-e-2ft^Zfc,-, (26) 

where := e~^'^, bi := r]i (1 — e"'''^) and Zkj are i.i.d. variables A^(0, 1). 

Observe that, by estimating the parameters of the simple linear regression equation 
(126|1 . we can construct estimators of Pi, r]i and cij. When X{0;i) is deterministic, we 
compute the least-squares estimators of Oj and bi, that is, the values dj and bi minimizing 
ulj, where Ukj := Xfc(tj+i; i) - (diXkitj; i) + h) are the residuals. Then 

o log(«i) . bi .2 1 sr-sr- 2 /o'7^ 



19 



When X{0;i) is random, we can compute /3j and af as in ( 127|) . but rji is better 
estimated by fji = Y.7=i Y.k=o^iji^k)/{ni {N + 1)). 



A. 2 Proofs of the results in 12.41 

Proof of Proposition [T] 

(a) By the functional CLT in (C[0, 1], || ■ ||) (see p. 172 in Araujo and Gine, 1980) the 
sequence y/n{'mn — m) converges weakly. This entails that the sequence \\^/n{rhn — rra)!! 
is bounded in probability which in turn implies (fT6|) . Concerning (fT7|) and (fT8|) . let us 
denote X*{t) = Xi{t) - m{t), t G [0, 1], i = 1, 2, . . .. Note that, for t G [K, I - K], 

m{t + hn) — m{t — hn) 



\m\t)-m'n{t)\ < 



m'{t) 



2hr 



^ n n 

' " ^-^ Zh„n ^-^ 



IhnU 



i=l 



1=1 



< 



m\t) 



m{t + hn) - m{t - hn) 



2hr 



n 

hnU ^ 

1=1 



(2J 



The CLT applied to the sequence {A*} allows us to conclude that the second term in 
the right-hand side of (128|) is Op ((n^/^/i.„)~^) . A second order Taylor expansion of the 
first term implies that there exist T/^i^^ ^ {t — hnt) and ipn^ G (t, t + hn) such that 

m{t + hn) - m{t - hn) 



m'it) 



2hr 



^\m"{^^:^)-m"{^l^^)\<^ = 0{hl) 



where L is the Lipschitz constant associated with m". 

Applying a similar reasoning to ( fT8|) . we obtain that, if t G [hn, 1 — hn], then. 



\m"{t)-rn:it)\< 



m"{t) 



m{t + hn) + Tn{t — hn) — 2m{t) 



hi 



+ 4 



hi 



1 " 
i=i 



■ (29) 



The CLT implies that the order of the second term in (12^ is Op {(n^^'^hn) A 
second order Taylor's expansion on t again gives that 

m{t + hn) + m{t — hn) — 2m{t) 



m"{t) 
(b) Since 

f(t,i)-r(t,i) 



< Lhn- 



mnit)){X:il) + m(l) - Anil)) - r(t, 1) 



lY.(^iX:{t)+m{t) 

I E ( - ^(t, 1) J + (m(t) - m„(t))i 5^ A;(1) 

+ (m(l) - m„,(l))- V X*{t) + (m(t) - m„(t))(m(l) - m„(l)). 
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then 



irr-,i)-rr-,ii 



< 



+ m — 



m l - m„ 1 



i 

+ ||m — m„|| |m(l) — mri(l 



-'■r). ^-'■'n ^-'■r). ^-'■n 



The assumption EHXiH"^ < oo imphes E||X*X*(l)|p < oo and thus the sequence 
{X*X*{1)} satisfies the CLT in the supremum norm. Then, since E[X*X*{1)] = r(-, 1), 
we have that Ti^'' = Op(n~^/^). Also T^^^ = Op{n~^) because the CLT (real case) im- 
plies that — Op{n~^^'^) and, according to Proposition [1] (a), ||m — rhnW = 
Op{n-^'^). 

The CLT applied to {X*} and Proposition [T] (a) yield that Tn'^ and T^f* are Op{n~^). 
This allows us to conclude (fT9|) . The derivatives of r(-, 1) are handled as those of m. 
The estimators of r(0, ■) and a{-) behave analogously to r(-, 1). □ 

Proof of Proposition [2] 

a) According to expression fl22|) for Vn{t), this estimator is a quotient of two convergent 
sequences. As that in the denominator, u,i(0), converges to n(0) > 0, an upper bound 
for the overall rate of the quotient is the slowest rate between f .„(0, t) and 'Un(O). Similar 
arguments apply for the first and second derivatives. 

b) Let t G [(5„,1]. The hypothesis on u' implies that inft>5^ ^(t) > 0(5„). Since 
n~^/^ = o{Sn), from ( [T9|) we obtain that mft>s„Unit) > Op(5.„). Therefore, a direct 
calculation based on the expression of ?)„ together with Proposition [1] b) leads to 



sup \Vn{t)-v{t)\ 
tG[5„,l] 



1 



(30) 



The same reasoning, taking into account the relative orders between 6n and /i„ leads to 

o.(^).om (31) 

Op{^^;r^]+0{-^]. (32) 



sup \ii{t)-v'{t)\ 

iG[<5„,l] 



sup \m-v"{t)\ 

te[5„,i] 



52 



Now, let t G [0,(5„]. Using the second-order Taylor expansion of v at 6n, together 
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with the definition (123|) of Vn, we obtain that there exists ipn ^ (t, Sn) such that 

\Vn{t) - V{t)\ < \v4Sn) - V{5n)\ + (5n " t)\v'^i5n) - 



where we have apphed fl30|) . fl3T|) and (132|) and the fact that v" is Lipschitz. Then the 
first statement in Proposition [2] b) is deduced from here and fl30l) . The remaining two 
statements are proved similarly. □ 

Next we state a technical lemma which will be employed to prove Theorem [31 

Lemma 1 Let {Y{t),t G [0, 1]} be a stochastic process whose mean function m{t) and 
variance function (J^{t) satisfy that m(0) = (t(0) = and both have a bounded derivative. 
Let {5n} be positive numbers which converge to zero. Then 



e/ |r(t)|rft = 0(5^/2) and E/ Y\t)dt = 0{Sl). 
Jo Jo 

Proof: Let be a common upper bound for the derivatives of and o"^. 

rSn r&n rSn 

/ ^Y{t)\dt < / E^'^{Y^{t))dt= / {m{tf + (7^{t)f'^dt 

Jo Jo Jo 

< {2Hf'^ / t^l^dt = 0{5T). 
Jo 

The second statement in the lemma follows analogously. □ 

Proof of Theorem O From expressions ([8]) and ([9]) we see that / = dfio/dfii is a 
function of rrii, Ui, Vi and their derivatives. Statement a) corresponds to the simplest 
case in which Mj(0) > 0. In this situation, the simple structure of the estimators shows 
that an upper bound for the convergence rate for log/„(x) is the worst rate for the 
estimators involved in its definition, namely that of the estimators Vq and v'l- 

Hence, we concentrate on part b). For simplicity we will omit the sub-index in vi for 
the rest of the proof. First notice that in the expressions for dfio/dfii which we obtained 
in Theorem [2] the second derivatives of v only appear inside integrals. In other words, 
we only need to care about differences of the type 

X'Xt)Ckn{t)v';{t) - k{t)v"{t)) dt = Op (^^' X'-(t)A;(t)«(t) - v"{t))dt^ , (33) 
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for r = 1, 2. Here is a function depending on u, v, u', v', m and m' and X is a mixture 
of the Brownian motions under consideration. Let us analyze the case in Theorem [2] b) 
for which r = 1 and the function k can be expressed as k = ki/ {v{{vu' — uv')"^), where 
ki is a function which can be written in terms of u, v, u', v', m and m'. Therefore, the 
assumptions in Theorem [2l imply that k is bounded. Let K be an upper bound of k. 

We split in two the integral in the right-hand side of ( l33i) . over the intervals [0, 5„] 
and [5n, 1]. Now, from (1521) in the proof of Proposition [21 we have that 



E 



Xit)k{tm{t)-v"it))dt\ 
1 



< O 



'p 



O 



Xl, . . . , Xn 



1 \ 1/2 

2/ 



¥.{X\t))dt 



(34) 



With respect to the other integral, we have that 



E 



x{t)k{t){v:{t)-v"{t))dt\ 



< K \\vl - v"\\ E 



\X{t)\dt = Op 



Xi, . . . , x„ 

xl/2 



+ 



n 



rl/2 
On 



0(5^), 



where the last equality comes from Lemma [Hand Proposition [2] b). Equations 
f[5S]) give 



(35) 
and 



E(^| X{t)k{t){v:{t)-v"{t))dt\ 
Taking hn = ^n"^ and (5„ = n~^/^^ 



Xi,...,xA<o 



-o(|)+o(.r). 



equates the three terms and yields the result. 



□ 



Proof of Proposition [3} It follows the same steps as the proof of Proposition [H the 
only difference being that if we apply a third order Taylor expansion in (1291) . we obtain 

Lhl 



m"{t) 



m{t + hn) + m{t — hn) — 2m{t) 



hi 



h 



'f\{m"\^l)-m'"m)\< 



3! 



and the result follows. 



□ 



Proof of Theorem [U Let us use the following inequality (see, e.g., Devroye et al, 
1996, p. 93) 

Ln-L* <2E{\7]{X)-r]n{X)\ \Xn), 

where rj is given in ([5]) and ?7„, is obtained substituting / = rf/io/rf/ii by /„ in ([5]). Without 
loss of generality in this proof we consider p = F{Y = 0} = 1/2. 
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Observe that, / and /„ are always positive since they are Radon-Nikodym derivatives 
of one probabihty measure with respect to another. Thus, for any x, we have 



\r]{x) -Vn{x)\ 



\fix)-Ux)\ 



< \fix)-Ux)\, 



(36) 



which imphes that 

<2e(|/(x)-/„(x)||a'. 

We obtain convergence rates (in probabihty) for the conditional expectation in the right 
of (l36l) . Since all the cases are similar, let us consider the simple situation in which 
mo 7^ nil and Fq = Fi = F with T{s,t) = n(min(s, t)) f (max(s, t)). Then 



f-fn 



dR 



mo,r '^-^mo.fo dP, 



mo,r ^^rho,to , '^-^mo,fo 



dPrm r dP~ p dPrm T dP~ p dP- p 

''^1;^ mi,ii '"'1?-'- mi,io ^1)^-0 



dP. 



(37) 



mi,ri 



By Theorem [2] (b) and the mean value theorem we have that, for any x, 

dP„ 



dP, 



dP - 

"'^^^'x)-^^{x) = e^{zi-z,), 



dP^ 



7711 ,r mi,ro 
where (using the notation of Theorem [2]) 



Z2 



+ ( - 2 ^) .(0) + 2 ^ .(1) - 2 m G'(t) 
^i;0 + I ^2;o - 2 ^1 x(0) + 2 ^ x(l) - 2 ^ G',{t) dt 



t'o(O) 



^o(l) 



^'o(i) 



and z = A^i + (1 — \)z2 for some A G [0, 1]. The subscripts in the expression of Z2 
mean that the estimation is carried out only with the sample from Pq. 
Consequently, 



E 



.dP, 



mo,r 



m.i,r 



{X) 



dP. 



{X)\ 



rfiiTo 



\Di-D 



1:0 



\Do-D 



2:0 



+2 



G(l) Go(l) 



ix(i)i+2 r\x{t)\ 

Jo 



G(0) G(0) 



^0) MO) 



G'{t) G',{t) 



v{l) Oo(l) 

<kI\Di- A;o| E (e^ll^ll + ( 1/^2 - ^2;o| + 2 max 



1^(0)1 



v{t) voit) 

Git) Go{t) 



vit) Voit) 



G'it) G'.it) 



v{t) Voit) 



E(||X||e^ll^ll|A'„) 



(38) 
(39) 
(40) 
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where k = expd^il + |-Di;o|) and 

A = max I ID2I + |-D2;o| 



Using Propositions [T] and [2] we obtain that the conditional expectations appearing in 
fl5^ and fHU]) are bounded in probabihty. Then 



G 

— H 


Gq 




G' 
— H 




V 




1 


V 


^^0 



E 



dPrr, f 



+ O P max 

t=0,l 



mi To 

G(t) Go{t) 



V{t) Vo{t) 



Of 



Op ( max \ Dj — D~.r)\ 
G'it) G'oit) 



v'it) v',{t) 



dt 



To find the convergence rates to of these last three terms we use the expressions 
of Di, D2 and G appearing in Theorem [2j Some straighforward computations yield 

|Di - D^,,\ = Op{\\v', - v'W), \D2 - ^2;o| = Op(||i;o - v\\), 

• 1 

and 



max 



G{t) Go{t) 



v{t) Vo{t) 
Thus we get 

E 



Op{H-v'\ 



G'{t) G',{t) 



v'it) v',{t) 



dt = Op(\\v' 



dPmi.V 



dP^ V 

mi,ro 



Op{\W'-v"\ 



(41) 



Let us now focus on the last term of ( 1371) . The analysis is similar to the one carried 
out above. On the one hand, for any x it holds that 



dP, 



mo, To 



dP. 



2B\\x\\ 



mi, To 

where B = max{\D2-o\, ||G'o/'0o||, ||<5o/'0o||). On the other hand, for any x it also holds 
that 



dP. 



mi, To 



dP^ 



X 



mi,ri 

< \Gi-Ci\ 



Gie' 



141x^(0) + 1(721x2(1: 



l|a;r l|^||2 



< \Gi-Ci\ + CiKe^\\' 
where A = {\C-i\ + \C2\ + \F'\/ [vqVi)) /2. Consequently 



E 



dP. 



mo. To 



dP. 



{X)\l 



dP. 



mi,Vo 



mi, To 



^^^i,fi 



{X)\ 



< K { |Ci - C,\ E (e^^ll^ll I + A E ( ||Xf e^^ll^ll+^H^"' | X^) } . (43) 
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The conditional expectations in (143|) and Ci are Op(l). The term A is Op ( 
v"\\). The difference \Ci — Ci\ is Op(maxj=o,i \\vj — v\\). Thus the term in ( H3l) is 
Op(maxj=o,i IK'j ~^'"||)- This, together with ( HTl) and Proposition |2] (a), yield the desired 
result. □ 
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k-NN fc-NN Nonpar. Param. Bayes 
oo PLS plug-in plug-in rule 


Two 
Brownian 
motions 


Deterministic 

at t = 
(^0 = ^1 = 0) 


c = 1.5, a = 1 


0.68 0.73 0.71 0.77 0.77 
(0.07) (0.07) (0.16) (0.06) (0.06) 


c = 3, a = 1 


0.90 0.91 0.86 0.93 0.93 
(0.05) (0.05) (0.16) (0.04) (0.03) 


c = 2, a = 2 


0.60 0.64 0.64 0.69 0.69 
(O.Ooj (^O.Ooj (^0.16j (0.0/j (^0.06j 


Random 
at t = 
(^0,^1^0) 


C = 1.5, (7=1 

/I /I 1 

00 = 01 = 1 


0.67 0.66 0.71 0.77 0.77 

{ r\ r\'-7\ { r\ r\o\ f r\ r\o\ f r\ n'7\ f r\ r\ci\ 

(0.07) (0.08) (0.08) (0.07) (0.06) 


c = 1.5, a = 1 

o o n 
Uq = Ui = U.O 


0.67 0.70 0.72 0.77 0.77 
(0.0/j (O.Ooj (O.Ooj (0.06j (0.06j 


Two 
Ornstein- 
Uhlenbeck 

processes 


Deterministic 
at t = 


l3o = 1, ?7o = 0, ctq = 1 
/3i = 1, ^1 = 1 


0.54 0.58 0.60 0.63 0.62 

(^O.Ooj (^O.Ooj (,0.14j [\).\Jl) [u.Olj 


/3o = 0.4, r/o = 0, (To = 0.4 
A = 1, m = 1 


0.83 0.86 0.82 0.88 0.88 
(0.09) (0.06) (0.16) (0.05) (0.05) 


Random 
at t = 


/3o = 0.5, ?7o = 0, o-Q = 1 
/3i = 1, r/i = 0.5 


0.59 0.60 0.63 0.63 0.64 
(0.13) (0.11) (0.14) (0.07) (0.14) 


/3o = 0.5, r/o = 0, (To = 2 
A = 1, ^1 = 2 


0.69 0.72 0.74 0.74 0.74 
(0.11) (0.10) (0.11) (0.07) (0.09) 



Table 1: Results of the Monte Carlo study 



fc-NN fc-NN Nonpar. 
II II oo PLS plug-in 

079 0^66 (185 
Table 2: Proportion of correctly classified for the transformed cell data. 
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0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

t t 



(c) (d) 

Figure 1: Some trajectories (Pq in gray and Pi in dotted black) of the processes used in 
the Monte Carlo study. In (a) and (b) we have two Brownian motions and in (c) and 
(d) the processes are Ornstein-Uhlenbeck. In (a) and (c) X {0)\Y = i is and in (b) and 
(d) it is random. 
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1000 2000 3000 1000 2000 3000 

Time in seconds Time in seconds 



(a) (b) 

Figure 2: Cell data (control group in grey and treatment group in black): (a) all the 
original observations; (b) sample means. 
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